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Abstract 

Background: With the advent of next-generation sequencers, the growing demands to map short DNA sequences 
to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are 
based on either a hash table or the suffix array/Burrow-Wheeler transform. These algorithms are the best suited to 
finding the genome position of exactly matching short reads. However, they have limited capacity to handle the 
mismatches. To find n-mismatches, they requires 0(2") times the computation time of exact matches. Therefore, 
acceleration techniques are required. 

Results: We propose a hash-based method for genome mapping that reduces the number of hash references for 
finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as 
words on Galois extension field GF{2^) and each word is encoded to a code word of a perfect Hamming code. The 
perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class includes 
subsequence whose corresponding words on GF{2^) are encoded to a corresponding code word. The code word is 
used as a hash key to store these subsequences in a hash table. Specifically, it reduces by about 70% the number 
of hash keys necessary for searching the genome positions of all 2-mismatches of 21 -base-long DNA subsequence. 

Conclusions: The paper shows perfect hamming code can reduce the number of hash references for hash-based 
genome mapping. As the computation time to calculate code words is far shorter than a hash reference, our 
method is effective to reduce the computation time to map short DNA sequences to genome. The amount of 
data that DNA sequencers generate continues to increase and more accurate genome mappings are required. 
Thus our method will be a key technology to develop faster genome mapping software. 



Background 

The history of bioinformatics has been dominated by 
the search for faster sequence alignment methods. 
Beginning with dynamic programming for protein and 
genome sequence alignment, many algorithms have 
been proposed. Hash tables are used in the series of 
FASTA programs [1], which calculate approximate 
alignments in shorter times than dynamic programming 
can. BLAST tools, using automatons in their algorithms. 
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are the most famous and most used alignment tools [2]. 
These tools are fast enough to align expression sequence 
tags generated by capillary electrophoresis-based DNA 
sequencers to target genomes. 

The emergence of next-generation sequencing tech- 
nology has changed the demands for alignment speed. A 
so-called next-generation sequencer can read far more 
base pairs than a conventional sequencer: more than 
two billion short DNA sequences in a single run. For 
such a large number of the sequences, BLAST tools are 
too slow to map the sequences to target genomes. 
Therefore, researchers have called for a faster approach 
that is focused on mapping short fragments. 
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To meet this demand, more than 25 software pro- 
grams designed for mapping short DNA sequences onto 
genomes have been developed. These are classified into 
two categories according to their algorithms, which are 
either hash-based or suffix array/ Burrow- Wheeler tran- 
sition (BWT)-based [3], MAQ [4] and SOAPvl [5] are 
two hash-based algorithms. The former indexes short 
DNA sequences and the latter indexes genome 
sequences. mrsFAST is the one of the newest algorithms 
that indexes both the short DNA sequences and the 
genome sequences [6]. The first genome-mapping algo- 
rithm based on suffix arrays was proposed in 2002 [7] 
and implemented as vmatch. Currently, a BWT-based 
algorithm is the fastest and is used in four tools: bowtie 
[8], BWA [9], SOAPv2 [10], and segemehl [11]. 

These algorithms are effective for mapping short 
sequences to genome positions of perfect matches and 
one-base mismatches, but are inefficient for mapping to 
positions for two or more-base mismatches. In general, 
they require 0(2^) computation time to calculate «-base 
mismatches. But genome mapping that allows only one- 
base mismatches is inadequate. In practice, 20 to 40% of 
short sequences cannot be mapped to the genome. 
Therefore, "wet" researchers require faster algorithms 
for mapping short sequences to genome positions with 
two or more-base mismatches. In this paper, we propose 
a method that can accelerate hash-based genome map- 
ping by reducing the number of hash references without 
increasing the size of the hash table. 

In the proposed method, DNA subsequences are 
divided into equivalence classes by using a perfect Ham- 
ming code. Each equivalence class includes subse- 
quences whose corresponding words on GF{2^) are 
encoded to the corresponding code word of the perfect 
Hamming code. The code word is used as a hash key to 
store these subsequences in a hash table. A perfect 
Hamming code is a special case of a Hamming code, 
known in the field of coding theory [12], that satisfies 
the Hamming bound with equality. Perfect Hamming 
codes have been applied to n-gram analysis of genome 
sequence [13] and multiple alignment [14]. 

Hash-based genome-mapping algorithms use hash 
tables. A hash table is an array indexed by hash values 
generated from hash keys. Thus, a hash table is an 
implementation of an associative array. There are two 
methods for mapping short reads onto genomes using 
hash tables. One is to store subsequences of the genome 
and their positions in a hash table and the other is to 
store subsequences of short reads. As there is no essen- 
tial difference between their hash usages, we use the for- 
mer method for the following explanation. 

The hash-based methods prepare a hash table whose 
keys and values represent subsequences of length / cut 
from a target genome and the subsequence genome 



positions, respectively. To map a short sequence to the 
genome, a sequence is cut into lengths /, and these are 
used as keys to refer to the hash table. The methods can 
find the genome position of a perfect match if the hash 
table returns at least one entry. In general, when the 
lengths of short sequences are longer than the length of 
the hash key, the methods expand the area of alignment 
from the genome position. The differences between 
methods are how and when they refer to the hash table. 

There are three methods to find the n-mismatch gen- 
ome positions of a subsequence of length / with the 
hash table. 

1. Refer to all n-mismatch subsequences. 

Prepare a hash table whose key length is /, and use the 
subsequence and its n-mismatch subsequences as keys 

to refer to the table. It requires " /C„3/ hash refer- 
ences to find all the n-mismatch genome positions. 

2. Store n-mismatch positions in the hash table. 

For each position of the subsequence of the genome, 

store the position " /C„3/ times. The hash keys 

are the subsequence and its n-mismatch subsequences. 

3. Use pigeonhole principle; combine hash table and 
another method. 

Generate a hash table whose key length is il/nj. After 
getting the perfect-match genome position of length [// 
«j by referring to the hash table, find n-mismatch 
sequences by another method, such as dynamic pro- 
gramming or BWT. 

Figure 1 shows examples of the three methods. 

These methods are effective when / is small and n 
equals 1. But they are difficult to use when « is 2 or more 
because the number of 2-mismatch sequences of length / 
is /(/ - l)/2 9 and that of 3-mismatches is /(/ - 1)(/ - 
2)/2 9. The first and second methods require too many 
hash references and too big a hash table, respectively. 
The third method is the best, but as n becomes larger, 
the ability to narrow the genome position down becomes 
weaker, and so the load of the post process to find n-mis- 
match sequences increases. To overcome these difficul- 
ties and improve the effectiveness of using hash tables for 
genome mapping, technical breakthroughs are needed. 

We propose a method to reduce the number of hash 
references to find the genome positions of 2 or more mis- 
matches without enlarging the size of the hash table. To 
realize the method, 4-ary perfect Hamming code is used. 

Results 

Perfect Hamming codes as hash keys 
Idea 

We first describe the main idea of the proposed method. 
We define a graph whose nodes are all the subsequences 
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Figure 1 Hash tables for three methods. Three methods to find genome positions of 1 -mismatch from the subsequence AAGT. Genome 
position 1000 is ACGT, which is the 1 -mismatch of the subsequence. The first method refers to the hash table 16 times. The second method 
refers to the table just once, but the table is 16-fold larger. The third method refers to the table three times. After getting position 1002 from 
the hash table, the method elongates the alignment toward the front of the sequence. 



of length / and edges are between pairs of subsequences 
of one nucleotide difference. The graph has 4^ nodes 
and each node has 31 edges. Assume we divide the 4^ 
nodes of the graph into 4^(3/ + 1) equivalence classes 
such that each class has a center node and 31 adjacent 
nodes. Then, we use the center node as the hash key to 
store the genome positions of 3/ -h 1 subsequences in a 
hash table. For example, Figure 2 shows a part of the 
graph of length 5. The node "AAA A A" has 15 adjacent 
nodes. These comprise an equivalence class with subse- 
quence"AAAAA" at its center. To store "ACAAA", 
which belongs to the equivalence class, the center word 
"AAAAA" is used as a hash key. 

The features of this hash table are as follows: (1) The 
number of entries in the hash table does not increase 
because each subsequence is stored only once. (2) Using 
this hash table, we can reduce the number of hash refer- 
ences to find the genome positions of subsequences of 1 
or more-mismatches. 

We explain the concept how to reduce the number of 
hash references to find 1- mismatches by using an exam- 
ple. Let the length of subsequence be 5, the hash table 
be as described above, and 5 be a subsequence for which 
we want to find the entries of 1 -mismatch. There are 15 
1-mismatch subsequences to be referred to. When s is 



the center subsequence of the equivalence class, such as 
"AAAAA" in Figure 2, using s as the hash key, one can 
find all the genome positions of 1-mismatch subse- 
quences. When 5 is not the center subsequence of its 
equivalence class, seven hash keys are needed. One hash 
key is the center subsequence of its equivalence class. 
This leads to 3 of the 15 1-mismatch subsequences. The 



I GAAAAl 




ACAAAl 

^ AGAAA1 



AAGAAl 



Figure 2 Relationship among 16 subsequences. Graphical 
depiction of subsequence "AAAAA" and 15 adjacent subsequences. 
Each node describes a subsequence and each edge indicates that 
the terminal nodes are of one nucleotide difference. The 15 
subsequences are divided into five groups according to the position 
of the different nucleotide. 
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rest of 12 subsequences are members of six different 
equivalence classes, and so the center subsequence of 
these classes are used as the other six keys. Figure 3 
shows the equivalence class that 5="CAAAA" belongs to 
and an equivalence class containing the two adjacent 
subsequences "CAAAT" and "CAAGA". Because the 
proportion of center words (nodes) to total words is 1/ 
16, the expected number of hash references is 1/16 1 
+ 15/16 7 = 6.625, which is 41.4% of the number of 
exact and 1 -mismatch subsequences. 

The requirements for the establishment of the equiva- 
lence classes need to be determined. At a minimum, the 
length of the subsequence / must satisfy the condition: 



equivalence classes = 



words 



words in a equivalence class 



is a natural number 



(1) 



3/ + 1 



This shows that 3/ + 1 must be a power of 4. For 
example, 1 = 5 and 21 satisfies the condition. 

It is not clear that the above equation is a sufficient 
condition for constructing equivalence classes. Even if it 
is, two problems still remain; how to construct the 
equivalence classes and how to calculate the center 
words from a given subsequence. Perfect Hamming 
codes provide solutions to both these problems. 
Perfect Hamming code 

A perfect Hamming code (PHC) is a Hamming code 
that satisfies the equation of the Hamming bound. 



|C|<- 



(2) 



where C is a set of q-divy block code of length n, d is 
the minimum Hamming distance between code words, 
and t = [^J. In the PHC method, all the received 
words are classified into a code word or a 1-bit error. In 




Center words 



Figure 3 1 -mismatch sequences of a non-code word sequence, 
'XAAAA" Idea behind finding 1-mismatcli sequences of tine 
sequence "CAAAA". Tine two circles indicate the equivalence classes. 
The sequence "CAAAA" belongs to an equivalence class whose 
center is "AAAAA" that holds 3 of the 15 1 -mismatch sequences. 
Two of the other sequences, "CAAAT" and "CAAGA", belong to a 
equivalence class whose center is "CAAGT". 



other words, all the words are decoded to code words 
whose Hamming distance is 0 or 1. 

The condition for a q-divy {n, /:)-Hamming code be a 
perfect code is n = q^l{q - 1). In the case of a 4-ary 
code, (5,3)-Hamming code and (21,18)-Hamming code 
are perfect. Their parity-check matrixes are shown in 
equations (3) and (4), respectively. 



H(5, 3) = 



10 11 1 
0 1 1 a 



1 0 0 0 1 1 1 
0 10 10 11 
0 0 1110 1 



110 0 1 



(3) 



(4) 



where (0, 1, a, a^) are the elements in the Galois field 
GF(2^). To get a code word x from received word z, the 
syndrome s = [z - x)H^ is used to determine the error 
vector e = {z - x) [12]. If the syndrome s is not zero, 
the column of H that is equal to or constant factor of s 
indicates the error position of z. The addition and mul- 
tiplication tables are shown in Table 1. 

The code word is calculated from a received word as 
follows. 

1. Calculate the syndrome s, 

2. If the syndrome s is zero, then the recieved word is 
the code word. 

3. Find a column c of parity-check matrix that is a 
constant factor t of the syndrome. 

4. Subtract t from the column c of the received word 
and the result is the code word. 

For example, assume the word z = [aaaa^O) is 
received. The code word of z is calculated as follows. 
The syndrome s oi z is: 



5 = zH^ = (aaaa^O) 



■iah). 



1 0 

0 1 

1 1 

1 a 
1 



As is equal to x (1 a), which is times the 
fourth column of //, subtract (OOOa^O) from z: 

z - (000a ^0) = [aaaa^O) - (000a ^0) = (aaaOO) 
The code word of z is (aaaOO). 

The (n, /c) -Hamming codes are composed of the infor- 
mation digits and the check digits. The information digits 
are k arbitrary digits of the code and the other n - k digits 
are the check digits. The generator matrix G can repro- 
duce check digits from information digits. Therefore, code 
words can be uniquely represented by the k digits. In this 
paper, we call these representations short codes. 
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Table 1 Addition and multiplication on GF(2^) 
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PHC and DNA subsequence 

DNA sequences are composed of four nucleotides, ade- 
nine, cytosine, guanine and thymine. Let these corre- 
spond one-to-one to the elements of Galois field GF{2^), 
Then, DNA sequences correspond to words on the 
Galois field. Without loss of generality, let (A,C,G,T) 
correspond to (0,1, a, a^). The sequence "GGGTA" is 
expressed as the word (aaaa^O), and the word (10000) 
represents the DNA sequence "CAAAA". 

This correspondence relationship and the PHC 
enables us to build the equivalence classes described in 
Section Idea, Each equivalence class is composed of a 
DNA subsequence that corresponds to a PHC code 
word and DNA subsequences whose corresponding 
words are error-corrected to the code word. Figure 2 
shows an equivalence class. The DNA subsequence 
"AAAAA" corresponds to word (00000) on GF{2^) and 
is a code word of 4-ary (5-3) -PHC. From the properties 
of PHC, All the words whose Hamming distances from 
the code word (00000) are 1 are error-corrected to the 
code word, and they are adjacent nodes of "AAAAA". 
Additional File 1 shows correspondence table of 5-mer 
subsequences and code words of 4-ary (5,3)-PHC. In the 
following, we regard DNA subsequences and their cor- 
responding codes on the Galois field as equivalent (i.e., 
aliases). 

Algorithms 

We propose a hash table for genome mapping whose 
hash keys are code words of PHC. Then we show its 
use and efficiency in finding genome positions of n-mis- 
matches and n-gaps. Following is a description of the 
notation used in this section. 

5 : a DNA sequence or its equivalent code on G(2^) 
c[s) : code word of 5 of PHC 
i{s) : information digits of c{s) 

£(5) = {X|C(X) = C(5)} 

: a set of subsequences that belong to the same equivalence class as s 
^ni^i'^i)- Hamming distance between 5 1 and 53 
Niis) = {x\diix, s) = i} 

: the set of subsequences whose Hamming distance from s is i 



Preparing the hash table 

There are two ways to construct hash tables for map- 
ping short DNA sequences onto a genome. One uses 
subsequences of the genome as hash keys to store their 
genome positions in a hash table. And another uses sub- 
sequences of short DNA sequence as the hash key. 
Because both of these use DNA subsequences as hash 
keys, our method can be applied to either. In the follow- 
ing, we use the former in the explanation. 

The hash table of our method uses the representative 
subsequence of the equivalence classes as hash keys. 
The representative is the code word on PHC. Without 
loss of generality, the information digits of the code 
words can be used as the hash key. Given an (n, k) - 
PHC on G(2^), let 5 be a set of subsequences of the gen- 
ome with length I = n. The genome positions of a sub- 
sequence 5 G 5' are stored in the hash table along with a 
hash key c{s) or its information digits i{c{s)). Figure 4 
shows the use of hash tables. 
Searching for n-mismatches 

In this section we describe how to find genome posi- 
tions of 1- and 2-mismatch subsequences of a given 
subsequence s given the hash table prepared as 
described in section Preparing the hash table. The effi- 
ciency of the method is also described. 

Let 5 be a DNA subsequence. The set of hash keys K 
required to refer to all the entries of perfect match and 
1- to n-mismatches is naturally expressed as follows. 



n 

K„{s) = [j{c{x)\xBN,{s)} 



1=0 



genome 



|A|C| A| A| A| I A| A| A|T|A| 



position:! 100 
Key Value 



position:2000 



Key Value 



kkkkk 


1100 




kkk 


1100 


kkkkk 


2000 


or 


kkk 


2000 













(short key hash table) 

Figure 4 Proposed hash table. Entries of proposed hash tables. 
Subsequence "AAAAA" is used as the l<ey for storing the genome 
positions of "ACAAA" and "AAATA" because"AAAAA" is the center of 
the equivalence class that "ACAA" and"AAAT" belong to. The left 
hash table uses the center subsequence itself as a hash key. The 
right one uses the short code of the center subsequence as a hash 
key, where the short code is the information digits of the code. 
Short codes are described in Section Perfect Homming Code. 
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The number of keys |/<r„(5)| is less than the number of 

subsequences to be referred to, ^ N (s)- Specifi- 

cally, when the length of 5 is 5 and 21, the expected 
number of keys for 1-mismatch Ki{s) are 6.625 and 
30.25, respectively. As the number of subsequences of 
perfect and 1-mismatch are 1 + 5C1 x 3, the method 
reduces the number of hash keys to 41.4% (= 6.625/16) 
and 47.7% (= 30.25/64), respectively. We summarize 
these values in the "ratio" column of Table 2. In the fol- 
lowing, we analyze the properties of the hash keys Ki{s) 
and /<2(5). 

First, we analyze Ki{s) for s of length 5. The subse- 
quence s is classified into two cases according to 
whether 5 is a code word on PHC. 

Case 1: 5 is a code word. As s is the code word, all the 
1-mismatch words of s are decoded to s by PHC. That 
is to say, they belong to the same equivalence class E{s). 
Therefore, s is used as a hash key and it can refer to all 
the 1-mismatch subsequences. 

Case 2: s is not a code word. Figure 5 shows the fif- 
teen 1-mismatch subsequences. As 5 is not a code word, 
the equivalence class E{s) has three 1-mismatch subse- 
quence of 5. One of these three is the code word c{s). 
These three differ from 5 at the same digit. Assume s= 
"AAAAG". Then, the code word c{s) is "AAAAA", and 
The words "AAAAC" and "AAAAT" belong to same 
equivalence class, E("AAAAG"). These differ at the fifth 
digit. The hash key c{s) can refer to four sequences at 
once. 

The rest of 12 words belong to six equivalence classes. 
Assume that word t differs at ;-th digit, t is not a code 
word because d^it, c{s)) = 2 and the distance between 
code words must be more than 3. The code word c{t) 
and t differ at the /c-th digit, where k /, 7. There is a 
word u that differs from 5 at the /c-th digit and c{t) at 
the ;-th digit. Because c[t) belongs to the equivalence 




Figure 5 Fifteen 1-mismatch subsequences of s when s is not a 
code word. Nodes represent subsequences and edges indicate a 
Hamming distance between two nodes of 1, namely, tine relation of 
1-mismatch. Edge labels indicate the position of the different digit 
(nucleotide), s belongs to a equivalence class of c(s), which also 
contains three other words. The other 12 words belong to six 
equivalence classes. 

V ) 

class E(t), use c{t) as a hash key and two words t and u 
can be referred to. Finally, the number of keys K\{s) to 
refer to all the 1-mismatch subsequences is seven. 

Because the proportions of Case 1 and Case 2 are 
respectively 1/16 and 15/16, the expected number of 
keys in K^(s) is 6.625 (1 1/16 + 7 15/16). 

Next, we show an algorithm to calculate the set of 
hash keys K\{s), 

Input : s 

Output : Ki{s) : a set of hash keys 

1. Ki{s) 91 c (s) 

2. 5 := c (5) and return {Ki (s)) 

3. for t in Ni{s) add c{t) to 7^1(5) 

4. return {Ki{s)) 

The algorithm calculates code words 16 times when s 
is not a code word, where the number of the hash keys 
is seven. There appears to be redundancy. There are 
various ways of reducing the computation time of the 
code words, but it is better not to use complicated algo- 
rithms. The calculation of code words is fast because 



Table 2 Summary of our methods for lengths 5, 21, and 10 to refer to 1- 


and 2-mlsmatch and 1- and 2-gap sequences 


length 


condition 


#keys 


#words 


ratio 


f{s, K) when s = c(s) 


f{s, /<) when c 7i c(s) 


5 


1-mismatch 


6.625 


16 


41.4% 


1 + 15x 


1 + 15x + 42x^ + 54x^ 




2-mismatches 


27.25 


106 


25.7% 


1 + 15 + 90x^ + 2]0x^ + 180x^ 


1 + 15 + 90x^ + 170x^ + 156x^ 




1-gap 


3.25 


4 


81.3% 


4 + 12x 


4 + 60x 




2-gaps 


10 


16 


62.5% 


16 + 36x + 108x^ 


_ 


21 


1-mismatch 


30.53 


64 


47.7% 


1 + 63x 


1 + 63x + 210x^ + 1710x^ 




2-mismatches 


611.31 


1954 


31.3% 


1 + 63x + 1890x^ + 441 Ox^ + 34020x^ 1 + 63x + 1890x^ + 5650x^ + 31500x^ 




1-gap 


3.81 


4 


95.3% 


4 + 60x 


4 + 252x 




2-gaps 


13.87 


16 


86.7% 


16 + 84x + 540x^ 


16 + 48x + 960x^ 


10: Serialize 


1-mismatch 


12.25 


31 


39.5% 


1 + 30x + 225x^ 


1 + 30x + 170x^ + 538x^ + 1089x^ + 1620x^ 


10: Parallelize 


1-mismatch 


13.25 


31 


44.1% 


1 + 30x 


1 + 30x + 84x^ + 108x^ 



*^ :s always includes one code word. neither the first half nor the second half are code words. The reference formula when one of the two halves is a code 
word Is 1 + 30x^ + 267x^ + 684x^ + 81 Ox'*. neither the first half or second half are code words. The reference formula when one of the two halves Is a code 
word Is 1 + 30x^ + 42x^ + 54x^. 
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the operations + and x on G(2^) can be calculated as 
binary operations. 

By using the set of hash keys Ki{s), not only the 
entries of s and all the 1-mismatch subsequences are 
referred to, but also the other entries are. For example, 
when 5 is not a code word, 12 subsequences of 2-mis- 
match belongs to the same equivalence class E{s), 
Therefore, the hash key c {s) refers to these subse- 
quences. To analyze the properties of these subse- 
quences, we define a reference formula: 

f{s,K) = ^ax\ 

where 5 is a subsequence, K is a set of hash keys, and 
the coefficient a represents the number of subse- 
quences that are referred to by K and whose distance 
from s is b. 

For Case 1, because s is the code word, all the 1 -mis- 
match subsequences belong to E{s), Therefore, the refer- 
ence formula for Case 1 is: 

/(5,Ki(s)) = l + 15x. 

For Case 2, there are 7 hash keys in Ki{s), Because the 
equivalence class of c {s) has 12 2-mismatch subse- 
quences, the reference formula of c (s) is: 

/(5,{c(5)}) = l + 3x + 12x^. 

Each of the equivalence classes of the other hash keys 
has two 1-mismatch sequences, five 2-mismatch 
sequences and nine 3-mismatch sequences. The refer- 
ence formula of Ki{s) - {c{s)} is: 

f{s, K^{s) - {c{s) } = (2x^ + 5x^ + 9x^) X 6 = 12x + 30x^ + 54x^). 

Finally, the reference formula for the Case 2 is: 

/(5, Ki(s)) = 1 + 15x + 42x^ + 54x^ 

The reference formula shows the proposed method 
searches many 2- and 3-mismatch sequences. We dis- 
cuss this feature in Section Discussion, 

The above algorithm and analysis can be applied to 
the word length 21. The numbers of hash keys are 1 
and 31 for Case 1 and Case 2, respectively. Using the 
rate of occurrences of Case 1 and Case 2, 1/64 and 63/ 
64, respectively, the expected number of hash keys is 
30.53. The reference formulas are shows in Table 2. 

To refer to all the entries of 2-mismatches, our 
method requires 27.25 hash keys, which is 25.5% of the 
number of subsequences with 2 or fewer mismatches, 
when the length of subsequences is 5. Figure 6 shows 
the subsequence s and its surroundings when 5 is a code 




Figure 6 Two-mismatch subsequences of s when s is a code 

word. Dark-gray nodes are 2-mismatches whose Hamming distance 

from s is 2. Each belongs to an equivalence class that has other two 

2-mismatche subsequences. 
L J 



word. The number of subsequences of exact 2-mis- 
matches |A/^2(5)| is 90, and the elements of |A/'2(5)| are 
neighbors of Ni{s), Each n g A/2 (5) has two neighbors of 
Ni{s) and each n belongs to an equivalence class that 
contains other two elements of N2{s), In other words, 30 
equivalence classes are required to cover N2{s), In total, 
the number of required keys /<2(5) is 31. The reference 
formula is: 

/(s,K) = l + 15x + 30x(3x^ + 7x^ +6x^) 
= 1 + 15x + 90x^ + 210x^ + 180x^. 

In Case 2, s is not a code word and Figure 7 shows 2- 
mismatch sequences of s. The equivalence classes 2-mis- 
match subsequences belong to are classified into four 
types. 

Type 1: 2-mismatch subssequences belong to this type 
are neighbors of c{s). 

Type 2: for some t such that dnit, s) = 1 and c{t) ^ c 
(5), 2-mismatch subsequences are neighbors of t belong- 
ing to c{t). 

Type 3: for some t such that dnit, 5) = 1 and c{t) ^ c 
(5), they are neighbors of t not belonging to c{t). 

Type 4: for some u such that dniuy 5) = 1 and c{u) = c 
(5), they are neighbors of u not belonging to c{s). 

Types 1 and 2 are included in Ki{s) and the numbers 
of subsequences are 12 and 30, respectively. The equiva- 
lence classes of Type 3 are the right-most circles in Fig- 
ure 7. Similar to subsequences of 2-mismatches in 
Figure 6, three 2-mismatch subsequences belong to each 
equivalence class. The number of equivalence classes is 
eight and they includes 24 2-mismatch subsequences. 
The bottom-left circle in Figure 7 represents Type 4. 
The relation between of Types 4 and 1 looks similar to 
that between Types 2 and 1. In fact. Type 4 is a union 
of Ki{u) - c{s) for over each possible u. As the size of u 
is 2, Type 4 has 12 equivalence classes. In total, the 
number of hash keys to refer to 2-mismatch subse- 
quences is 27 (= 1 + 6 + 8 + 12). The reference formula 
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is as follows: 

f{s, K^is)) = (1 + 3x + 12x^) + 6{2x + 5x^ + 9x^) + 8{3x^ + 7x^ + 6x^) + 12{2x^ + 5x^ + 9x*) 
= 1 + 15x + 90x^ + 170x^ + 156x'* 

In the same manner, we can analyze the hash keys K2 
(s) for the subsequence of length of 21. In this case, the 
reference formula when 5 = c{s) becomes: 

/(s, ^2 W) = 1 + 63x + 630 X (3 + 7x + 54x^) 

= 1 + 63x + 1890x^ + 4410x^ + 34020x^, 

and the formula when s c{s) is: 

/(5,K2W) = l + 3x + 60x^ :Typel 

+30(2x + 5x^ +57x^) : Type2 

+520(3x^ +7x^ + 54x^) : Type3 

+60(2x^ +5x^ +57x'^) : Type4 
= 1 + 63x + 1890x^ + 5650x^ + 31500x*. 

Search n-gaps 

To align DNA subsequence and a genome, there are 
three types of gaps. These are gaps in short DNA 
sequence, gaps in genome sequence, and gaps in both. 
Our method can reduce the number of hash keys to 
refer to gaps in short DNA sequences. Given a subse- 
quence with gaps 5, hash keys to refer to the genome 
positions are a set of code words of subsequences which 
with the gaps of 5 are substituted with nucleotides. The 
expected numbers of hash keys are 3.25 and 3.84 when 
the length of a subsequence with one gap is 5 and 21 
respectively. 



Let 5 be a subsequence with one-gap, such as "AA- 
AA", and 5' be a set of subsequences for which a gap in 
s is substituted with a nucleotide, in this case the set 
comprising "AAAAA", "AACAA", "AAGAA", and 
"AATAA". To find the genome position that matches 5, 
we need to refer to entries that correspond to S, 

When a subsequence t g 5 is a code word, all the sub- 
sequences in S belong to the same equivalence class. 
Therefore, by using c{t) as a hash key, all the entries 
correspond to gapped subsequence 5 can be referred to 
from. In this case, the reference formula becomes 4 + 
12;\:, where the index number of x represents the mini- 
mum distance from the four subsequences in S, 

If no ^ G 5' is a code word, the four subsequences 
belong to different equivalence classes. Therefore, four 
hash keys are required and the reference formula is 4 + 
60x, The proportion for which one member of 5 is a 
code word is of words in a equivalence class = 4/ 
16 = 1/4, and so the expected number of hash keys is 
3i(=lxl/4 + 4x4/3). 

In the same way, when the length is 21, the expected 
number of hash keys is 3^(= 1 x 1 / 16 + 4 * 15 / 16) . 
The reference formulas are shown in Table 2. 

Next, we consider a subsequence with two gaps. Let s 
be a subsequence of length 5 with two gaps such as "A- 
A-A" and S be the set of 16 sequences for which the 
gaps in s are replaced with nucleotides. The number of 
information digits in (5, 3)-PHC is 3, and so one of 16 
words in 5' is a code word t. The code word t is the 
only code word in S because the maximum Hamming 
distance among words in S is 2, which equals the 
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number of gaps, and the minimum Hamming distance 
between code words is 3. The equivalence class E{t) 
includes seven subsequences of S; let U be the remain- 
ing subsequences {U = S - Ni{t)), Because u g LI is not 
a code word and dniu, t) = 2, the distance between t 
and the code word c{u) is 3 (<i//(c(w),^) = 3). This implies 
that c{u) and t differ at a non-gapped position in t. 
Therefore, for all r/, v g U, u ^ v and d{c{u), v) = d{Uy v) 
+ 1 > 2, Thus, all the subsequences in U belong to dif- 
ferent equivalence classes and nine (=|^|) hash keys are 
required to refer to 2-gap subsequences. 

For example, if 5="A-A-A", S includes a code word t 
= 'KKKKK" and t can refer to 7 sequences: ''KKKKA!\ 
"ACAAA", "AGAAA", "ATAAA", "AAACA", "AAAGA", 
and "AAATA". The set of remaining subsequences are 
U = {"ACACA", "ACAGA", "ACATA", "AGACA", 
"AGAGA", "AGATA", "ATACA", "ATAGA", "ATAGA" 
}. Let u L U = "ACACA". The code cord c{u) is 
"ACACC" and Hamming distance from the rest of U 
are two or three. Therefore, all subsequences in U 
belong to different equivalence classes. 
Length of subsequence 

The code length of the PHC is restricted to 5 or 21 in 
practice. This is inconvenient. Therefore, we next 
explain ways to elongate the code length. There are four 
ways to elongate the code length. The first way is to 
simply add nucleotides before or/and after the code 
words. Two other ways are serialization and paralleliza- 
tion of PHCs. The former method serializes more than 
two PHCs and serialized code words are used as the 
hash key. The latter one uses more than two hash tables 
for the parallelization and is a way to utilize the pigeon- 
hole principle where pigeons are mismatches or gaps 
and holes are the regions without mismatches and gaps. 
The fourth way is a combination of the three. Figure 8 
shows the concepts behind each. In the following, we 
show a serialized code and a parallelized code, both of 
length 10. 

Let s = S1S2 be a sequence of length 10 and Si and S2 
be the subsequences of length 5. The hash key used in 
the serialization to store s is c{si)c{s2). In this case, each 
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Figure 8 Three ways to elongate the length of subsequences. 



equivalence class holds 256 subsequences. To refer to 
the entry of a sequence s = S1S2 and the 30 1 -mismatch 
sequences, the set of hash keys is: 

K^{s) = {tu\te K^{si),ue i^i(52)}, 

and the expected number of hash keys is 12.25. To 
prove this, we consider four cases: 

Case 1 Si and S2 are both code words. 

Case 2 5i is a code word, but S2 is not. 

Case 3 52 is a code word, but Si is not. 

Case 4 neither Si nor S2 is a code word. 

In Case 1, Use 51^2 as a hash key; this can refer to all 
the 1 -mismatches. The reference formula in this case is 
the square of the reference formula of length 5, (1 + 
I5x){l + 15x) = I + 30x + 225x^, 

In Case 2, we need to consider two subcases based on 
the position of the 1 -mismatch. When the position is in 
the first half of 5, the hash key 51^2 can refer to all of 
them. When the position is in the second half, the sec- 
ond half of the hash keys becomes one of the seven 
words in 7^1(52). Therefore, a set of hash keys is: 

K,{s) = {s,t\teK,{s2)}. 

Because S2 e Ki{s2)> the hash keys in the second sub- 
case include the hash keys in the first subcase. There- 
fore, the number of hash keys |/<ri(5)| is 7. The reference 
formula is: 

(1 + 15x)(l + 15x + 42x^ + 54x^) 
= 1 + 30x^ + 267x^ + 684x^ + 810x^. 
In Case 3, similar to in Case2, the set of hash keys is: 

K,{s) = {tS2\tsK,is,)}, 

and the reference formula is same as that of Case 2. 

In Case 4, the set of hash keys is the union of {sit\t g 
and {tS2\t G Ki{si)}, which correspond to 1-mis- 
match in the first half and 1-mismatch in the second 
half, respectively. Because both of these include c{si)c 
(52), the number of hash keys is 13 (= 7 + 7 - 1). The 
reference formula is: 

(1 + 3x + 15x^)(l + 15x + 42x^ + 54x^) x 2 - (1 + 3x + ISx^f 
= 1 + 30x + 170x^ + 538x^ + 1089x^ + 1620x^ 

The proportions of cases 1 through 4 of 1/256, 15/ 
256, 15/256, and 225/256, respectively, and so the 
expected number of hash keys is 12.25. 

The parallelization requires two hash tables and each 
subsequence is stored in both hash tables. Therefore, 
the total size of the hash tables is twice that of serializa- 
tion. The hash keys are c{si)s2: the first haf is PHC, and 
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Sic{s2): the second half is PHC. Consequently, two types 
of equivalence classes are used in the parallelization and 
each equivalence class holds 16 subsequences. The set 
of hash keys for 1 -mismatch sequences is: 

K,{s) = {tS2\te K,{s,)}u{s,tG K.is^)}, 

where each set corresponds to one of the two hash 
tables. The expectation number of keys is 13.25. 

Let us consider the four cases, which are the same as 
those for serialization. In case 1, use S1S2 as the hash key 
for two hash tables and all the entries of 1-mismatch 
are referred to. In this case, the reference formula is (1 
+ I5x) + (1 + I5x) -1 = 1 + 30x. As a subsequence S1S2 
is stored in both hash tables, one is subtracted in the 
formula. Though the number of hash keys appears to be 
one, it is used twice. Thus the number of hash keys \Ki 
{s)\ is two. 

In Case 2, the hash keys are: 

Kiis) = {s,S2}^{s,t\tGK,{s2)}. 

The total number hash keys is 8 and the reference for- 
mula is: 

(1 + 15x) + (1 + 15x + 42x^ + 54x^) 
= l + 30x + 42x^ + 54x^. 

Case 3 is similar to Case 2. In this case the hash keys 
are: 

K,{s) = {tS2\tGK,{s,)}^{s,S2} 

In Case 4, The hash keys when the first half includes 
the mismatch are {tS2\t g Ki{si)} and that for the second 
half are {sit\t e Ki{s2)}. The number of hash keys is 14 
and the reference formula is: 

(1 + 15x + 42x^ + 54x^) X 2 - 1 
= l + 30x + 84x^ +108x^. 

The proportions of the cases are same as for serializa- 
tion, and the expected number of hash keys is 13.25. 

Discussion 

To search genome positions of n-mismatches and «-gaps 
with our method. Table 2 shows it also searches some 
positions of «+a-mismatch. For example, our method 
searches not only 1 perfect match and 63 1 -mismatch 
subsequences, but also 210 2-mismatches and 1710 3- 
mismatches as byproducts 11.1%(= 210/1890) and 4.8%(= 
1710/35910) of all 2- and 3-mismatches, respectively. 
This proportion increases to 46.7%(= 42/90), 20% (= 54/ 
270) when 1 = 5, These byproducts are, in fact, effective. 
One reason for the current low mapping ratios from 



DNA sequencers of short reads to a genome is the small 
number of mismatches and gaps that the employed map- 
ping method can find. Therefore, increasing the numbers 
of mismatches and gaps will contribute to increasing the 
mapping ratio and subsequent biological analyses, even if 
the method is probabilistic. 

The increasing demand to map massive amounts of 
short DNA sequences to genomes is inevitable. Because 
the number of short sequences is enormous, it is diffi- 
cult to ensure finding all genome positions of 1-mis- 
matches in a practical computation time. Therefore, 
faster methods are required and the proposed method is 
a step in that direction. We have shown that the pro- 
posed method can reduce the number of keys necessary 
to find the genome positions of n-mismatches. The 
main idea behind the method is to classify the subse- 
quences into equivalence classes using PHC. Because 
equivalence classes contain multiple subsequences, our 
method can increase the density of the hash table over 
those using in the usual method. That is to say, our 
method can use longer subsequences. 

For example, the size of human genome is about 3G 
bases long. When this is stored it with subsequences of 
length 21 in usual way, the density of the hash table is 3 
X 10^^/4^^ « 0.07%. On the other hand, the hash table 
using our proposed method using (2 1,1 8) -PHC, the den- 
sity is 3 X 10^^/ [# of equivalence classes] = 3 x 10^^/4^^ 
« 4.7%. That is to say, our method can use longer sub- 
sequences. The length of subsequence is sensitive to the 
efficiency of the genome-mapping programs, and the 
longer the better, for a given density of hash table. 
Therefore, the proposed method has an advantage from 
this point of view. 

We consider the computation time for code words is 
far shorter than a hash reference when we describe the 
effectiveness of the proposed method. In practice, the 
calculation of the syndrome using the parity-check 
matrix of the Hamming code is very short, even if on 
GF(2^), and so it is easy to calculate the code word from 
a subsequence. Also, the calculation is small enough to 
be executed within a CPU cache. On the other hand, 
the size of hash table is larger than the size of CPU 
caches. Some exceeds the size of memory because the 
number of entries is almost equal to the length of the 
target genome. The hash reference is apparently slower 
than the calculation of the code word. Therefore, the 
advantage of reducing the hash references exceeds the 
disadvantage of additional tasks to calculate code words. 
With these advantages, our method will help to imple- 
ment faster genome mapping programs. 

Conclusions 

The paper shows perfect hamming code can reduce the 
number of hash references for hash-based genome 
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mapping. The method encodes subsequences to perfect 
hamming codes on GF{2^) and use them as hash l<eys. It 
can reduce by about 70% the number of hash I<eys 
necessary for searching the genome positions of all 2- 
mismatches of 21-base-long DNA subsequence. As the 
amount of data that DNA sequencers generates con- 
tinues to increase and more accurate genome mappings 
are required, our method will help to develop faster 
genome mapping software. 

Additional material 



Additional file 1: DNA sequences and their code words. All the 5 

mer DNA sequences and their code words on 4-ary (5,3)-perfect 
Hamming codes. 
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